{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An Introduction to Using [Pyserini](https://pyserini.io/) for DSPy\n",
    "\n",
    "Pyserini is a tool maintained by the Data Systems Group at the University of Waterloo, and you can use it to incorporate your own data into `dspy.Retrieve`. Currently, `dspy.Pyserini` supports using your own Faiss index one of pyserini's [prebuilt indexes](https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py) to perform retrieval."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Installation/Setup\n",
    "Using `dspy.Pyserini` will require installing pyserini, Pytorch, and Faiss. Pyserini can be installed with `pip install pyserini`, and if you're on your own device we'll leave it to you to decide the right versions of Pytorch and Faiss to install.\n",
    "\n",
    "On Colab, make sure to run this notebook with GPU by going to Edit > Notebook Settings > Select a GPU under Hardware Accelerator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_15686/3452029479.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html\n",
      "  import pkg_resources\n"
     ]
    }
   ],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import sys\n",
    "import pkg_resources \n",
    "\n",
    "try: # When on Colab, let's install pyserini, Pytorch, and Faiss\n",
    "    import google.colab\n",
    "    repo_path = 'dspy'\n",
    "    !git -C $repo_path pull origin || git clone https://github.com/stanfordnlp/dspy $repo_path\n",
    "    %cd $repo_path\n",
    "    !pip install -e .\n",
    "    if not \"pyserini\" in {pkg.key for pkg in pkg_resources.working_set}:\n",
    "        !pip install pyserini\n",
    "    if not \"torch\" in {pkg.key for pkg in pkg_resources.working_set}:\n",
    "        !pip install torch\n",
    "    if not \"faiss-cpu\" in {pkg.key for pkg in pkg_resources.working_set}:\n",
    "        !pip install faiss-cpu\n",
    "except:\n",
    "    repo_path = '.'\n",
    "    # Install the package if it's not installed\n",
    "    if not \"dspy-ai\" in {pkg.key for pkg in pkg_resources.working_set}:\n",
    "        !pip install -U pip\n",
    "        !pip install dspy-ai\n",
    "        # !pip install -e $repo_path\n",
    "\n",
    "if repo_path not in sys.path:\n",
    "    sys.path.append(repo_path)\n",
    "\n",
    "import dspy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Using Pyserini's prebuilt indexes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Attempting to initialize pre-built index beir-v1.0.0-nfcorpus.contriever-msmarco.\n",
      "/store2/scratch/j5xian/cache/pyserini/indexes/faiss.beir-v1.0.0-nfcorpus.contriever-msmarco.20230124.657649d19fafd06cb031c6b11868d7f9 already exists, skipping download.\n",
      "Initializing beir-v1.0.0-nfcorpus.contriever-msmarco...\n",
      "Top 3 passages for question: How Curry Can Kill Cancer Cells \n",
      " ------------------------------ \n",
      "\n",
      "1] Curcumin and Cancer Cells: How Many Ways Can Curry Kill Tumor Cells Selectively? Cancer is a hyperproliferative disorder that is usually treated by chemotherapeutic agents that are toxic not only to tumor cells but also to normal cells, so these agents produce major side effects. In addition, these agents are highly expensive and thus not affordable for most. Moreover, such agents cannot be used for cancer prevention. Traditional medicines are generally free of the deleterious side effects and usually inexpensive. Curcumin, a component of turmeric (Curcuma longa), is one such agent that is safe, affordable, and efficacious. How curcumin kills tumor cells is the focus of this review. We show that curcumin modulates growth of tumor cells through regulation of multiple cell signaling pathways including cell proliferation pathway (cyclin D1, c-myc), cell survival pathway (Bcl-2, Bcl-xL, cFLIP, XIAP, c-IAP1), caspase activation pathway (caspase-8, 3, 9), tumor suppressor pathway (p53, p21) death receptor pathway (DR4, DR5), mitochondrial pathways, and protein kinase pathway (JNK, Akt, and AMPK). How curcumin selectively kills tumor cells, and not normal cells, is also described in detail. \n",
      "\n",
      "2] Triphala, Ayurvedic formulation for treating and preventing cancer: a review. BACKGROUND: Triphala (Sanskrit tri = three and phala = fruits), composed of the three medicinal fruits Phyllanthus emblica L. or Emblica officinalis Gaertn., Terminalia chebula Retz., and Terminalia belerica Retz. is an important herbal preparation in the traditional Indian system of medicine, Ayurveda. Triphala is an antioxidant-rich herbal formulation and possesses diverse beneficial properties. It is a widely prescribed Ayurvedic drug and is used as a colon cleanser, digestive, diuretic, and laxative. Cancer is a major cause of death, and globally studies are being conducted to prevent cancer or to develop effective nontoxic therapeutic agents. Experimental studies in the past decade have shown that Triphala is useful in the prevention of cancer and that it also possesses antineoplastic, radioprotective and chemoprotective effects. CONCLUSIONS: This review for the first time summarizes these results, with emphasis on published observations. Furthermore, the possible mechanisms responsible for the beneficial effects and lacunas in the existing knowledge that need to be bridged are also discussed. \n",
      "\n",
      "3] Dietary turmeric potentially reduces the risk of cancer. Turmeric, a plant rhizome that is often dried, ground and used as a cooking spice, has also been used medicinally for several thousand years. Curcumin, the phytochemical that gives turmeric its golden color, is responsible for most of the therapeutic effects of turmeric. In recent years curcumin has been studied for its effects on chronic diseases such as diabetes, Alzheimer's, and cancer. Though many researchers are investigating turmeric/curcumin in cancer therapy, there is little epidemiologic information on the effects of turmeric consumption. With limited availability of pharmacologic interventions in many areas of the world, use of turmeric in the diet may help to alleviate some of the disease burden through prevention. Here we provide a brief overview of turmeric consumption in different parts of the world, cancer rates in those regions, possible biochemical mechanisms by which turmeric acts and practical recommendations based on the information available. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "pys_ret_prebuilt = dspy.Pyserini(index='beir-v1.0.0-nfcorpus.contriever-msmarco', query_encoder='facebook/contriever-msmarco', id_field='_id', text_fields=['title', 'text'])\n",
    "\n",
    "dspy.settings.configure(rm=pys_ret_prebuilt)\n",
    "\n",
    "example_question = \"How Curry Can Kill Cancer Cells\"\n",
    "\n",
    "retrieve = dspy.Retrieve(k=3)\n",
    "topK_passages = retrieve(example_question).passages\n",
    "\n",
    "print(f\"Top {retrieve.k} passages for question: {example_question} \\n\", '-' * 30, '\\n')\n",
    "\n",
    "for idx, passage in enumerate(topK_passages):\n",
    "    print(f'{idx+1}]', passage, '\\n')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Using your own data\n",
    "As an example, we'll be using [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/), a full-text learning to rank dataset for medical information retrieval. This corpus is quite small so encoding, indexing, and retrieval should be tolerable on CPU. \n",
    "\n",
    "First, let's fetch the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
      "--2023-09-16 12:33:31--  https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip\n",
      "Resolving public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)... 130.83.167.186\n",
      "Connecting to public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)|130.83.167.186|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 2448432 (2.3M) [application/zip]\n",
      "Saving to: ‘collections/nfcorpus.zip’\n",
      "\n",
      "nfcorpus.zip        100%[===================>]   2.33M  2.64MB/s    in 0.9s    \n",
      "\n",
      "2023-09-16 12:33:33 (2.64 MB/s) - ‘collections/nfcorpus.zip’ saved [2448432/2448432]\n",
      "\n",
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
      "/bin/bash: unzip: command not found\n"
     ]
    }
   ],
   "source": [
    "!wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip -P collections\n",
    "!unzip collections/nfcorpus.zip -d collections"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can use pyserini to encode and pack up our data into a Faiss index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
      "3633it [00:00, 78339.56it/s]\n",
      "100%|███████████████████████████████████████████| 57/57 [00:17<00:00,  3.33it/s]\n"
     ]
    }
   ],
   "source": [
    "!python -m pyserini.encode \\\n",
    "  input   --corpus collections/nfcorpus/corpus.jsonl \\\n",
    "          --fields title text \\\n",
    "  output  --embeddings indexes/faiss.nfcorpus.contriever-msmarco \\\n",
    "          --to-faiss \\\n",
    "  encoder --encoder facebook/contriever-msmarco \\\n",
    "          --device cuda:0 \\\n",
    "          --pooling mean \\\n",
    "          --fields title text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can use `dspy.Pyserini` to read our local Faiss index and perform retrieval. Note that using a local index requires passing in a Huggingface `Dataset` for document lookup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top 3 passages for question: How Curry Can Kill Cancer Cells \n",
      " ------------------------------ \n",
      "\n",
      "1] Curcumin and Cancer Cells: How Many Ways Can Curry Kill Tumor Cells Selectively? Cancer is a hyperproliferative disorder that is usually treated by chemotherapeutic agents that are toxic not only to tumor cells but also to normal cells, so these agents produce major side effects. In addition, these agents are highly expensive and thus not affordable for most. Moreover, such agents cannot be used for cancer prevention. Traditional medicines are generally free of the deleterious side effects and usually inexpensive. Curcumin, a component of turmeric (Curcuma longa), is one such agent that is safe, affordable, and efficacious. How curcumin kills tumor cells is the focus of this review. We show that curcumin modulates growth of tumor cells through regulation of multiple cell signaling pathways including cell proliferation pathway (cyclin D1, c-myc), cell survival pathway (Bcl-2, Bcl-xL, cFLIP, XIAP, c-IAP1), caspase activation pathway (caspase-8, 3, 9), tumor suppressor pathway (p53, p21) death receptor pathway (DR4, DR5), mitochondrial pathways, and protein kinase pathway (JNK, Akt, and AMPK). How curcumin selectively kills tumor cells, and not normal cells, is also described in detail. \n",
      "\n",
      "2] Curcumin and cancer: an \"old-age\" disease with an \"age-old\" solution. Cancer is primarily a disease of old age, and that life style plays a major role in the development of most cancers is now well recognized. While plant-based formulations have been used to treat cancer for centuries, current treatments usually involve poisonous mustard gas, chemotherapy, radiation, and targeted therapies. While traditional plant-derived medicines are safe, what are the active principles in them and how do they mediate their effects against cancer is perhaps best illustrated by curcumin, a derivative of turmeric used for centuries to treat a wide variety of inflammatory conditions. Curcumin is a diferuloylmethane derived from the Indian spice, turmeric (popularly called \"curry powder\") that has been shown to interfere with multiple cell signaling pathways, including cell cycle (cyclin D1 and cyclin E), apoptosis (activation of caspases and down-regulation of antiapoptotic gene products), proliferation (HER-2, EGFR, and AP-1), survival (PI3K/AKT pathway), invasion (MMP-9 and adhesion molecules), angiogenesis (VEGF), metastasis (CXCR-4) and inflammation (NF-kappaB, TNF, IL-6, IL-1, COX-2, and 5-LOX). The activity of curcumin reported against leukemia and lymphoma, gastrointestinal cancers, genitourinary cancers, breast cancer, ovarian cancer, head and neck squamous cell carcinoma, lung cancer, melanoma, neurological cancers, and sarcoma reflects its ability to affect multiple targets. Thus an \"old-age\" disease such as cancer requires an \"age-old\" treatment. \n",
      "\n",
      "3] Triphala, Ayurvedic formulation for treating and preventing cancer: a review. BACKGROUND: Triphala (Sanskrit tri = three and phala = fruits), composed of the three medicinal fruits Phyllanthus emblica L. or Emblica officinalis Gaertn., Terminalia chebula Retz., and Terminalia belerica Retz. is an important herbal preparation in the traditional Indian system of medicine, Ayurveda. Triphala is an antioxidant-rich herbal formulation and possesses diverse beneficial properties. It is a widely prescribed Ayurvedic drug and is used as a colon cleanser, digestive, diuretic, and laxative. Cancer is a major cause of death, and globally studies are being conducted to prevent cancer or to develop effective nontoxic therapeutic agents. Experimental studies in the past decade have shown that Triphala is useful in the prevention of cancer and that it also possesses antineoplastic, radioprotective and chemoprotective effects. CONCLUSIONS: This review for the first time summarizes these results, with emphasis on published observations. Furthermore, the possible mechanisms responsible for the beneficial effects and lacunas in the existing knowledge that need to be bridged are also discussed. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "dataset = load_dataset(path='json', data_files='collections/nfcorpus/corpus.jsonl', split='train')\n",
    "\n",
    "pys_ret_local = dspy.Pyserini(index='indexes/faiss.nfcorpus.contriever-msmarco', query_encoder='facebook/contriever-msmarco', dataset=dataset, id_field='_id', text_fields=['title', 'text'])\n",
    "\n",
    "dspy.settings.configure(rm=pys_ret_local)\n",
    "\n",
    "dev_example = \"How Curry Can Kill Cancer Cells\"\n",
    "\n",
    "retrieve = dspy.Retrieve(k=3)\n",
    "topK_passages = retrieve(dev_example).passages\n",
    "\n",
    "print(f\"Top {retrieve.k} passages for question: {dev_example} \\n\", '-' * 30, '\\n')\n",
    "\n",
    "for idx, passage in enumerate(topK_passages):\n",
    "    print(f'{idx+1}]', passage, '\\n')\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dspy-pys",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
